Linear Regression

Randy Johnson

1/7/2019

Overview

Least Squares Intuition

Sand flea data from McDonald, J.H. 1989. Selection component analysis of the Mpi locus in the amphipod Platorchestia platensis. Heredity 62: 243-249.

See the Handbook of Biological Statistics for additional examples.

Least Squares Intuition

Least Squares Intuition

Model Notation

eggs = β0 + β1 mg + ε

Model Output

eggs = β0 + β1 mg + ε

## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  12.6890     4.2009   3.021   0.0056 **
## mg            1.6017     0.6176   2.593   0.0154 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

eggs = 12.7 + 1.6*mg + ε

Assumption 1: Error terms are normally distributed

Two ways to test this assumption:

## 
##  Shapiro-Wilk normality test
## 
## data:  .std.resid
## W = 0.93555, p-value = 0.08517

See my code and ?shapiro.test for more details.

Assumption 1: Error terms are normally distributed

## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferonni p
## 16 3.425071          0.0021295     0.059625

See my code and ?outlierTest from the car package for more details.

Influence

Influence

Influence

We see our borderline significant outlier (#16) has very little influence over the coefficients, but it does result in a change in the error terms.

* The trend line is not changed much if we exclude this data point from the analysis.
* It does shift the line slightly up, which results in a slightly larger estimate for all data points.

The 24th data point is also highlighted here. It has a larger residual (not an outlier) and is on the edge of the distribution. This gives it increased leverage and influence.

See my code and ?influencePlot from the car package for more details.

Assumption 1: Error terms are normally distributed

## [1]  1 16

We see our borderline outlier in the top right corner is larger than we would expect it to be, but not quite outside of the confidence region within which we can expect our error terms to fall.

See my code and ?qqPlot from the car package for more details.

Assumption 2: Linear Relationship

size = β0 + β1 length + ε

size = β0 + β1 length + β2 length2 + ε

Data from Ashton, K.G., R.L. Burke, and J.N. Layne. 2007. Geographic variation in body and clutch size of gopher tortoises. Copeia 2007: 355-363

Assumption 2: Linear Relationship

size = β0 + β1 length + ε

## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.59974   19.53898   0.031    0.976
## length       0.02435    0.06316   0.386    0.706

size = β0 + β1 length + β2 length2 + ε

## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -9.634e+02  2.958e+02  -3.257  0.00625 **
## length       6.264e+00  1.913e+00   3.275  0.00603 **
## length2     -1.008e-02  3.088e-03  -3.263  0.00617 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Component Residual Plots

size = β0 + β1 length + ε

Component Residual Plots

size = β0 + β1 length + β2 length2 + ε

Assumption 2: Linear Relationship

How might you analyze data with a non-linear relationship?

Assumption 3: No/Little Multicollinearity

## # A tibble: 569 x 31
##    radius_mean texture_mean perimeter_mean area_mean smoothness_mean
##          <dbl>        <dbl>          <dbl>     <dbl>           <dbl>
##  1        18.0         10.4          123.      1001           0.118 
##  2        20.6         17.8          133.      1326           0.0847
##  3        19.7         21.2          130       1203           0.110 
##  4        11.4         20.4           77.6      386.          0.142 
##  5        20.3         14.3          135.      1297           0.100 
##  6        12.4         15.7           82.6      477.          0.128 
##  7        18.2         20.0          120.      1040           0.0946
##  8        13.7         20.8           90.2      578.          0.119 
##  9        13           21.8           87.5      520.          0.127 
## 10        12.5         24.0           84.0      476.          0.119 
## # ... with 559 more rows, and 26 more variables: compactness_mean <dbl>,
## #   concavity_mean <dbl>, `concave points_mean` <dbl>,
## #   symmetry_mean <dbl>, fractal_dimension_mean <dbl>, radius_se <dbl>,
## #   texture_se <dbl>, perimeter_se <dbl>, area_se <dbl>,
## #   smoothness_se <dbl>, compactness_se <dbl>, concavity_se <dbl>,
## #   `concave points_se` <dbl>, symmetry_se <dbl>,
## #   fractal_dimension_se <dbl>, radius_worst <dbl>, texture_worst <dbl>,
## #   perimeter_worst <dbl>, area_worst <dbl>, smoothness_worst <dbl>,
## #   compactness_worst <dbl>, concavity_worst <dbl>, `concave
## #   points_worst` <dbl>, symmetry_worst <dbl>,
## #   fractal_dimension_worst <dbl>, metastatic <lgl>

Breast cancer data from the UCI Machine Learning Repository.

See this kaggle notebook by Mike M. Lee for a more thorough exploration of this dataset.

Assumption 3: No/Little Multicollinearity

# check for multicollinearity
glm(metastatic ~ radius_mean + perimeter_mean + area_mean + texture_mean + smoothness_mean + concavity_mean, data = bc) %>%
    vif()
##     radius_mean  perimeter_mean       area_mean    texture_mean 
##      829.419705      873.476620       40.812258        1.176198 
## smoothness_mean  concavity_mean 
##        1.644838        6.850699

Assumption 3: No/Little Multicollinearity

Options:

Assumption 4: No Autocorrelation

Data on measles incidence in Baltimore were collected from published reports by various people and hosted by Ben Bolker at McMaster University.

Assumption 4: No Autocorrelation

lm(Incidence ~ dt, data = measles) %>%
    durbinWatsonTest()
##  lag Autocorrelation D-W Statistic p-value
##    1       0.7823128     0.4342107       0
##  Alternative hypothesis: rho != 0

Assumption 5: Homoscedasticity

These are partial census data for females in the Dobe area !Kung San, compiled from interviews conducted by Nancy Howell in the late 1960s.

Assumption 5: Homoscedasticity

## Warning in spreadLevelPlot.lm(mdl1): 
## 7 negative fitted values removed

## 
## Suggested power transformation:  1.039703
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.110857, Df = 1, p = 0.2919

Assumption 5: Homoscedasticity

How might you model heteroscedastic data?

Transformations

Many of the problems we encounter in linear regression stem from model choice.

Two most common transormations:

Power Transformations

Original model: y ~ β0 + β1x + ε

With power transformation: y ~ β0 + β1x2 + ε

Power Transformations

Original model: y ~ β0 + β1x + ε

With power transformation: y ~ β0 + β1sqrt(x) + ε

Log Transformations

Original model: y ~ β0 + β1x + ε

With log transformation: y ~ β0 + β1ln(x) + ε

Log Transformations

Original model: y ~ β0 + β1x + ε

With log transformation: ln(y) ~ β0 + β1x + ε

Assumptions Summary

Assumption Cause Consequence Diagnosis
Linear Relationship Bad model Inaccurate predictions car::crPlots()
Multivariate Normality Bad model Incorrect statistics car::qqPlot()
Noisy data (p-values / CIs) shapiro.test(residuals)
No/Little Multicollinearity Correlated variables Unstable model coefficients car::vif()
No Autocorrelation Non-independent data Inefficient estimators car::durbinWatsonTest()
Homoscedasticity Bad model Incorrect statistics car::ncvTest()
“Bad” data (p-values / CIs) car::spreadLevelPlot()

Thanks

Statistics for Lunch Team

* Greg Alvord
* Eckart Bindewald
* Taina Immonen
* Brian Luke
* George Nelson
* Ravi Ravichandran
* Tom Schneider

See https://github.com/abcsFrederick/LinearRegression for code.